Ali Tongyi is making another effort! Qwen3 series full modal AI model continuously upgraded, leading the industry in multimodal retrieval

2026-01-12

At the beginning of the new year in 2026, the Alibaba Tongyi Qianwen team continues to promote the pace of AI technology innovation. After the release of the upgraded version of the Qwen3-Omni-Flash-2025-12-01 full modal large model in December 2025, on January 8th, it was announced again that the Qwen3-VL Embedding and Qwen3-VL Ranker models were open-source, specifically designed for multimodal information retrieval and cross modal understanding scenarios. This marks a new stage in Tongyi Qianwen's technological layout in the field of multimodal AI, providing developers with a complete solution from content understanding to accurate retrieval.

Qwen3 Omni Flash has been fully upgraded, leading to a leap in audio and video interaction experience

The Qwen3-Omni-Flash-2025-12-01 version launched by the Alibaba Tongyi Qwen team in December 2025 is a comprehensive upgrade based on the native full modal large model Qwen3 Omni. This new version can seamlessly process various input forms such as text, images, audio, and video, and generate both text and natural speech output through real-time streaming response.

The four core highlights of this upgrade are particularly prominent. Firstly, the audio and video interaction experience has been comprehensively improved, significantly enhancing the understanding and execution ability of audio and video instructions, effectively solving the common problem of "reduced intelligence" in colloquial scenarios, and significantly improving the stability and coherence of multi round audio and video conversations, making the interaction more natural and smooth.

Secondly, the system prompts for a leapfrog breakthrough in control capability. The new version fully opens up the customization function of System Prompt, achieving fine control over model behavior. Whether it is the design style (such as Sweet Sister, Imperial Sister, Japanese style, etc.), colloquial expression preferences, or reply length requirements, they can be accurately implemented, greatly improving control.

Thirdly, the ability to follow multiple languages is more reliable. The model supports 119 text language interactions, 19 speech recognition languages, and 10 speech synthesis languages, thoroughly optimizing the issue of unstable language compliance in the previous version and ensuring accurate and consistent responses in cross language scenarios.

Fourthly, speech generation is more personified and smoother. The new version completely solves the problems of sluggish speech speed and mechanical rigidity, significantly improving the model's ability to adaptively adjust speech speed, pauses, and rhythm based on text content. The speech expression is natural and vivid, and the degree of personification is close to the level of human dialogue.

In terms of objective performance indicators, the full modal capability of Qwen3-Omni-Flash-2025-12-01 has significantly improved. The ability to understand and generate text is stronger, with significant improvements in tasks such as logical reasoning (ZebraLogic+5.6), code generation (LiveCodeBench-v6+9.3, MultiplaL-E+2.7), and comprehensive writing (WriteBench+2.2); More accurate speech understanding, significantly reduced word error rate in speech recognition, and improved speech dialogue evaluation score by 3.2 points; Deeper understanding of images, achieving breakthroughs in multidisciplinary visual question answering (MMMU+4.7, MMMU_de+4.8) and mathematical visual reasoning (Mathview_full+2.2) tasks; Video comprehension is more coherent, and the ability to understand video semantics is continuously optimized, providing a solid foundation for real-time video conversations.

New benchmark for multimodal retrieval: Qwen3-VL series open source release

On January 10, 2026, the Tongyi Qianwen team once again launched heavyweight open source products - Qwen3-VL Embedding and Qwen3-VL Ranker model series. These two models are built on Qwen3-VL and are specifically designed for multimodal information retrieval and cross modal understanding scenarios, achieving industry-leading levels in authoritative benchmark tests.

The Qwen3-VL Embedding model adopts a dual tower architecture, which can process inputs containing text, images, screenshots, and videos within a unified framework. By fully utilizing the advantages of the Qwen3-VL basic model, it generates semantically rich vector representations and captures both visual and textual information in a shared space, achieving efficient cross modal similarity calculation and retrieval.

The Qwen3-VL-Rranker model adopts a single tower architecture as a powerful supplement to the Embedding model. This model receives input pairs (Query, Document), where both the query and document can contain any single or mixed modality, and outputs accurate correlation scores. In practical retrieval scenarios, Embedding and Reranker models work together: the Embedding model is responsible for the initial recall stage, while the Reranker model is responsible for the reordering stage. This two-stage process significantly improves the final retrieval accuracy.

Performance evaluation data shows that the Qwen3-VL-Embedding-8B model has achieved industry-leading results on the MMEB-V2 benchmark, surpassing all previous open source models and closed source commercial services. Excellent performance has been demonstrated in diverse tasks such as image and text retrieval, video text matching, visual Q&A, and multimodal content clustering. All Qwen3-VL-Rranker models consistently outperform the base Embedding model and baseline Reranker model, with the 8B version achieving optimal performance in most tasks.

Open source ecosystem continues to expand, global developers share technological dividends

The Tongyi Qianwen team continues to make efforts in open source strategy. Following the open source of the text oriented Qwen3 Embedding and Qwen3 ReRanker model series in June 2025, this time we are launching a multimodal version, providing developers with a complete toolchain from basic understanding to precise retrieval for building complete multimodal AI applications.

The newly released model inherits the multilingual capability of Qwen3-VL, supporting over 30 languages and suitable for global applications. The model provides flexible vector dimension selection, customizable task instructions, and powerful performance after vector quantization, allowing developers to easily integrate the model into existing processes for application scenarios that require strong cross language and cross modal understanding capabilities.

Since 2023, the Alibaba Tongyi team has open sourced over 300 models, including two major prototype series: the big language model Qianwen and the visual generative model Wanxiang Wan. The open source release of the Qwen3-VL series is an important exploration by Tongyi Qianwen in the field of unified multimodal representation and retrieval, marking the beginning of Alibaba's systematic layout on the path of multimodal AI technology.

The Tongyi team stated that the open source of Qwen3-VL Embedding and Qwen3-VL Ranker is a new starting point, and looks forward to working together with the global developer community to explore and build more universal unified multimodal retrieval capabilities, and promote the development and practical application of multimodal AI technology. With the completion of the upgrade of Tongyi Qianwen 3-omni flash and Tongyi Qianwen 3-omni flash real-time models on the Bailian platform on January 5, 2026, developers can more conveniently experience the latest technological achievements and inject AI innovation power into various application scenarios.